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Abstract. Redundancy of experimental data is the basic statistic from which the complexity of a natural 
phenomenon and the proper number of experiments needed for its exploration can be estimated. The 
redundancy is expressed by the entropy of information pertaining to the probability density function of 
experimental variables. Since the calculation of entropy is inconvenient due to integration over a range 
of variables, an approximate expression for redundancy is derived that includes only a sum over the set 
of experimental data about these variables. The approximation makes feasible an efficient estimation of 
the redundancy of data along with the related experimental information and information cost function. 
From the experimental information the complexity of the phenomenon can be simply estimated, while 
the proper number of experiments needed for its exploration can be determined from the minimum of 
the cost function. The performance of the approximate estimation of these statistics is demonstrated on 
two-dimensional normally distributed random data. 

PACS. 06.20.DK Measurement and error theory - 02.50.+S Probability theory, stochastic processes, and 
statistics - 89.70.+C Information science 

1 Introduction sured variables and, from them extract physical laws pQ. 

Related to this task, experimenters must decide how many 

The basic task of experimental physical exploration of nat- 

experiments to perform in order to provide proper exper- 

ural phenomena is to provide quantitative data on mea- 
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imental data. We know that it is reasonable to repeat ex- tion (PDF) of the instrument's output scattering during 
periments as long as they yield essentially new data, and calibration is described by the scattering function tp(x, u). 
to stop repetition when the data become redundant. In When the scattering is caused by mutually independent 
order to describe this concept objectively, we have intro- disturbances in the experimental system, the scattering 
duced in previous articles [213] two statistics called exper- function is Gaussian |H4j : 

(x — u) 



imental information I and redundancy R of experimental \ 

ip(x, u) = g(x — Xi, a) = — exp 

data based on the entropy of information 4J . Their differ- 
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(1) 



ence C = R-I can be interpreted as the information cost We a PP^ this function in our further treatment. Th e mean 

function of the experimental exploration. From the cost value u and standard deviation a can be estimated statis- 

function minimum, the proper number N of experiments ticall y b ^ repetition of calibration. 

can be determined in an objective way. The entropy of in- Let x * denote the most P^bable instrument output 

formation is defined by the integral of a nonlinear function in the z " th experiment. Using ^(x, x % ) we describe the 

of the probability density function of experimental data, Parties of the explored phenomenon during the i-th 

and consequently its calculation is numerically demand- experiment. Similarly, the properties in a series of N re- 

ing. This property represents a serious obstacle, especially P eated experiments, which yield the basic data set { Xi ; i = 

when treating multivariate data. Therefore, our aim is to h---,N} t axe described by the experimentally estimated 

PDF" 

show how this obstacle can be effectively avoided by es- ' ^ 

timating data redundancy without integration. For this Jn(x) — — ^^ip(x,Xi). (2) 

i=l 

purpose we first briefly repeat the route to the definition In addition, we introduce a uniform reference PDF p(x) = 

of redundancy |2l3j and subsequently show how the inte- 1/(2L) indicating that all outcomes of the experiment are 

gral in the corresponding expression can be approximated, hypothetically equally probable before executing the ex- 

The performance of the derived approximate method of periments. 

calculation is demonstrated using two-dimensional nor- Based upon functions Jn{x) and p(x) we describe the 

mally distributed random data. indeterminacy of variable x by the negative value of the 

relative entropy [51617] : 

2 Redundancy of experimental data f tt ^ (f N {x) 



Let us consider a phenomenon characterized by N mea- 
surements of a variable x using an instrument with span 
S x = (—L,L). Properties of the instrument are specified 

by calibration on a unit u. The probability density func- ^ u ~ J s J dx. (4) 



Similarly, we describe the uncertainty H u of calibration 
performed on a unit u by: 
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Using the difference of these statistics we define the ex- If \xi — Xj\ ^> a for all pairs there is no overlapping 
pcrimcntal information: of functions ip(x, Xi), ip(x, Xj); therefore, the sum in the 

logarithm is ~ 0, and consequently the redundancy is R ~ 
0. In the opposite case, when \xi — Xj\ <C cr, it follows 

f(x)log(f N (x))dx 

S x that tp(x, Xi) ~ ip(x,Xj). Due to good overlapping in this 



I — H T — H„ 



+ / ip(x,u)\og(vji{x,u)) dx. (5) case, the corresponding term in the expression of i? yields 

J Ox 

tt • u ioi- 4-v,- • + log(2) IN and i? > 0. 

Using hq.[2jm this expression we get: &v y/ 

1 r /\— > \ This property indicates that experimental information 
J = log(AT) - — ^ / V^^^log^^^^OJ da; 

i=1 Sx i =1 is increasing with increasing iV as I ~ log(iV) if the ac- 



u) logr0(x, tt)J dx. (6) 



quired data are well separated with respect to a. However, 



T r , .,, . ] , with an increasing number of data, they are ever more 

It we express the logarithm m the second term as: ° ' J 

N N densely distributed, which results in an increasing overlap- 

log(^(x,x J ))=\og^x,x l ) + \o g (l + ^t^l) . .,.,. w . +w . . , , 

' v W\ x t x i) ' pmg of distributions that causes increasing redundancy of 

(7) 

measurements. Although the expression in Eq.[TO] for re- 

we obtain: 

dundancy R is rather cumbersome due to the included 

i N r 

I = log(iV) H ^ / tp(x,Xi)\og(^)(x,x t )j dx integral, we expect that R could be estimated without 

N N integration by the simpler function of distances between 

-lv/ ^ x ,x i )\og(i+y t ^4)dx 

Jv ~[Js x ^ V{x,Xi)y data points. For this purpose we next consider the prop- 

ip(x, u) logCip(x, ufj dx. (8) erties of the scattering function tp(x, x{). 

rp, j j ,-1 p ,1 , . c If the Gaussian function ^(a;, Xi) = q(x — x», a) is con- 

Ihe second and the fourth term on the right side of this rv ' ,; 3V ' ' 

. 1Jr , , sidered as an approximation of the delta function 8 (x— Xi), 

equation yield and we get: ^ v 

N N and the logarithm as a slowly changing function, the inte- 

I = lodN)- — , , , . 

gration in Eq.[T0]can be carried out, which yields for the 

(9) 



mjv)-1e/ s _^, I( )io 6 (i + i:||^)^ 



m 

(91 

redundancy the first order approximate expression with 
With the last term we introduce the statistic called redun- 

out the integral: 

dancy of data: 

«-w±L«*«M>+±%i$)« <»> *^S bg { 1+ SiST-} (12) 

with which we get the basic relation: 

If we take into account Eq.[TJ we get for the redundancy 
/ = log(iV) — R (11) the following approximate expression that depends only 
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on standard functions of distances between data points: approximate formula in Eq.[15]only the term (xi — Xj) in 

1 J^L f r (x - — x ) 2 i~] the exponential function has to be replaced by the norm 

^~jv5> g { 1+ £ exp r^2^U ( 13 ) 

<=i j#i of corresponding vectors. Due to this advantage, it is also 

However, this first order approximation is rather rough reasonable to estimate approximately the experimental in- 

because the distribution ^{x u Xj ) has the width a > and formation using the basic formula j = log(iV ) _ R The 

the logarithm in EqM includes the fraction of functions experimental information I converges with the increasing 

i>(x, xj)/il>(x, Xi ). To proceed to a better approximation, number of data N tQ & certain ^ yalue from which ^ 

we have examined the case of just two data points, since complexity of thc phenomenon under investigation can be 

it mainly determines the property of the redundancy. In estimated using the formula K « exp^oo) introduced 

this case the integration of thc first three terms in a Taylor previously ^ . The complexity K indicates how many 

series expansion of thc logarithm yields the second approx- non _ over i app i ng scattering distributions are needed in the 



imation: 



estimator Eq.[2] to describe the PDF of the observed phe- 

N , N 

nomenon. 

The information cost function is the difference of the 
redundancy and experimental information: C = R — I . 
During minimization of this cost, the experimental infor- 
mation provides for a proper adaptation of the PDF es- 
timator to the experimental data, while the redundancy 
prevents excessive growth of the number of data points. By 
the position of the cost function minimum we introduce 
thc proper number N Q of the data and the corresponding 



1 r < \2 ^ 

i? 2 ^£log{l + £exp[-^^]}, (14) 

i=i 1 j#i ' > 

which is obtained from the previous one by merely chang- 
ing 2cr 2 — > 4cr 2 . This property indicates that a still better 
approximation could be obtained by properly adapting 
la 1 in Eq. 1131 For this purpose we have proceeded with 
numerical investigations which have shown that a nearly 
optimal approximation is obtained if 2<r 2 in Eq.[T3] is re- 
placed by ~ 5.1er 2 : 

l N f - r (x -x) 2 i) 

R a ~ — JZ^Si 1 + ^ exp — — j— >. (15) experiments that are needed to judiciously represent the 

N i=i ^ m 5 ' lcr ^ 

phenomenon under exploration. By inserting the expres- 

Numerical investigations have further shown that this for- 

sion I = log (AT) — R into C = R — I, we obtain for the 

mula also yields good results in cases with many data 

information cost function the formula: 

points. 

Since the integral is excluded from Eq.[TSJ the redun- ^ — ^ l°§(-^0- 0-®) 

dancy R can be estimated from Eq. ll5l with essentially less Therefore the proper number N a can also be determined 
computational effort than from Eg. 1101 This advantage is from the approximately estimated redundancy R . This 
especially outstanding in a multivariate case where the number roughly corresponds to the ratio between the mag- 
redundancy is defined by multiple integrals, while in the nitude of the characteristic region where experimental data 
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appear and the magnitude of the characteristic region cov- 
ered by the scattering function [213] . 

3 Numerical examples 

To demonstrate the properties of the approximations Ri , 
i?2, Ro let us first consider the case of just two data 
points separated by a distance x\ — xi- Fig. [T] shows the 
dependence of redundancy R on relative distance d = 
(X1 — X2) /cr as determined by the integral in Eq. 1 101 and ap- 
proximations in Eqs. 1131141151 Improvement achieved by 
subsequent steps of approximation and a fairly good agree- 
ment between approximation R a and R calculated by the 
integral is evident. However, in a case with more data 
points we can generally expect slightly worse agreement 
due to overlapping of more than two scattering functions 
in the sum of the approximation formula in Eq. 1151 The 
performance in such a case is demonstrated in the next 
example. 

In order to provide for reproduction of the demon- 
strated example, we consider a two-dimensional Gaussian 
random phenomenon with zero mean value. The stan- 
dard deviation of both components is equal to s = 2.5, 
while their covariance is zero. The data generated by a 
standard Gaussian generator are represented in the two- 
dimensional span (—10, +10) ® (—10, +10) using the scat- 
tering width a = 0.5. In such a case we can theoretically 
predict that the proper number of data samples should be 
N « (s/a) 2 = 25. 




Fig. 1. Dependence of redundancy R on relative distance d — 
(xi — xi)/a between data points as determined by the integral 
in Eg. 1101 and approximations in Eqs. 1131141151 




Fig. 2. PDF determined by 100 data points x%,y%- 

generated. The corresponding probability density function 
was estimated using Eq.[5] adapted to the two-dimensional 
case with statistically independent components: 



1 N 

fN(x,y) = — 22ip(x,Xi)ip(y,yi). 



(17) 



For the demonstration, a set of N max = 100 two- The resulting PDF with N — 100 is graphically repre- 
dimcnsional data samples {{xi,yi);i = 1 . . . N max } was sented in Fig.O 
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REDUNDANCY VERSUS NUMBER OF DATA 



INFORMATION STATISTICS VERSUS NUMBER OF DATA 




90 100 




Fig. 3. Dependence of redundancy R on number N of data Fig. 4. Dependence of information statistics on the number N 
points as determined by the integral in Eg .[101- (R), and ap- of data points as approximately determined from Eq.[TU The 
proximation in Eq.QI]- (Ro) adapted to the two-dimensional minimum of the cost function occurs at N = 28. 



From the generated data the redundancy was calcu- 
lated using Eq s .ITTJ1 and [T5l adapted to the two-dimensional 
case. The dependence of redundancy R on the number N 
of accounted data points is shown in Fig. [3] Fairly good 
agreement between both statistics is again evident. 

Approximately estimated redundancy was further uti- 
lized in the calculation of statistics / and C. They are 
shown as functions of the number of data points N in 
Fig. [4] together with R(N) and log(iV). Agreement with 
the same statistics calculated more exactly by integration 
can be established by comparing this figure with Fig.H 
In both cases we obtain for the proper number the value 
N a = 28. This value depends on the statistical properties 
of the data set used in its calculation; a statistical esti- 
mation from 100 different data sets yields the estimate 
iV « 25 ± 13 which agrees well with the theoretically pre- 




90 100 



Fig. 5. Dependence of information statistics on the number TV 
of data points as determined based on integration. 

dieted value N a = 25. Similarly as in the one-dimensional 
case it turns out that the function fN {x,y) is only a 
rough estimator of the hypothetical PDF. This property is 
a consequence of the fact that experimental information / 
and redundancy R have equal weights in the cost function 
C = R-I. 
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Figs.H] and [5] indicate that experimental information I 
converges with increasing N to a certain limit value from 
which the complexity of the phenomenon under investiga- 
tion can be approximately estimated as K sa exp lN max ■ 
In our case we get the estimate K s» 21. The number 
of non-overlapping scattering distributions that represent 
the PDF of the observed phenomenon is thus slightly smaller 
than the proper number N a of experiments needed for its 
exploration. 

4 Conclusions 

From the statistics introduced in the previous articles [213] 
based on information entropy, we have here derived an ap- 
proximate formula for the calculation of redundancy R of 
experimental data. It is important that this formula does 
not include the integral by which the information entropy 
is defined. This makes feasible a simplified and fairly good 
estimation of redundancy and, with it, the related exper- 
imental information and cost function. The advantage of 
the approximate calculation becomes outstanding in mul- 
tivariate cases because multiple integration is not needed 
there. A serious obstacle for the application of the con- 
cept of experimental information and redundancy of data 
can thus be avoided. Efficient estimation of the experi- 
mental information and cost function, and with them the 
determined complexity of the phenomenon and the proper 
number of experiments needed for its exploration, could 
be considered valuable in planning experimental work. In 
addition, the complexity K or the proper number N Q could 
be applied in the field of neural networks |H8j to deter- 
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mine the appropriate number of cells needed to deal with 

a certain phenomenon. 
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